Decision

I have chosen to KEEP my Midterm as part of my final grade. The analysis below will describe how I have come to this conclusion. Keep in mind that my analysis creates a model that will be used in the Conclusion Tab to determine my final test score. Knowing my predicted final test score allowed me to come to a logical conclusion. It is also important to note that I used the dataset provided by Bro. Saunders containing previous students scores in my analysis.

Analysis

Data

My first step is to look at the data to see what I’m working with and the variables included. The datatable below shows a snippet of the data.

datatable(midterm, options=list(lengthMenu = c(5,10,25)))

From this snippet there are a couple things that stand out. The data looks good for the most part. There are 9 variables as shown in italics below.

colnames(midterm) %>% 
  pander()

Gender, Midterm, FinalExam, AssessmentQuizCompletion, AssessmentQuizActual, AnalysesTheory, PeerReviews, ClassActivities and SkillQuizzes

Something that could cause issues later on is the missing values in the dataset. Rather than filter these rows out, lets go ahead and fill these in with “0”s. It could be that their other information is useful in the analysis. If we see anything different, we’ll filter these out later.

A updated dataset is shown below:

midterm <- midterm %>% 
  replace(is.na(.), 0)
datatable(midterm, options=list(lengthMenu = c(5,10,25)))




Determining the Model

With the data set ready to go. Lets take a quick look at the initial pairs plot to see how the variables, including the midterm variable, looks in relation to the final test score. The first pairs plot is shown below:

pairs(midterm, panel = panel.smooth)

There are several variables that seem interesting; in particular, the Gender, Midterm, AnalysesTheory, AssessmentQuizActual and SkillQuizzes. The two that stand out the most are the Midterm and AnalysesTheory. I’m going to start with the Midterm variable and since the data seems to have a curve to it, we’ll just include a squared term now and see if it is significant.

Running the regression we get the following results:

lm1 <- lm(FinalExam+1~Midterm + I(Midterm^2), data = midterm)
  
summary(lm1)
## 
## Call:
## lm(formula = FinalExam + 1 ~ Midterm + I(Midterm^2), data = midterm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.141  -7.661   4.859  11.263  31.657 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.692345   8.859572   4.254 8.76e-05 ***
## Midterm      -0.059799   0.399867  -0.150    0.882    
## I(Midterm^2)  0.006058   0.004306   1.407    0.165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.85 on 52 degrees of freedom
## Multiple R-squared:  0.2841, Adjusted R-squared:  0.2566 
## F-statistic: 10.32 on 2 and 52 DF,  p-value: 0.0001682

The summary shows that Midterm is not significant in either term. The fit of the model is also poor with our R-Squared being 0.2841.

However, despite this, I’m curious. Lets look at the residuals on the pairs plot to see if anything stands out.

pairs(cbind(R=lm1$res, Fit = lm1$fit, midterm), panel = panel.smooth)

AnalysesTheory and AssessmentQuizActual still stand out to me. Starting with AnalysesTheory, this seems to be a simple line, so we’lll add that into our model, with an interaction with Midterm.

We get the following results.

lm2 <- lm(FinalExam+1~Midterm + I(Midterm^2) + AnalysesTheory + Midterm:AnalysesTheory, data = midterm)
summary(lm2)
## 
## Call:
## lm(formula = FinalExam + 1 ~ Midterm + I(Midterm^2) + AnalysesTheory + 
##     Midterm:AnalysesTheory, data = midterm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.414  -4.956   4.758  10.371  25.465 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -6.867933  14.205293  -0.483 0.630868    
## Midterm                 0.044512   0.359632   0.124 0.901994    
## I(Midterm^2)            0.009103   0.004018   2.265 0.027849 *  
## AnalysesTheory          0.850762   0.228478   3.724 0.000499 ***
## Midterm:AnalysesTheory -0.007951   0.003285  -2.420 0.019179 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.38 on 50 degrees of freedom
## Multiple R-squared:  0.4727, Adjusted R-squared:  0.4306 
## F-statistic: 11.21 on 4 and 50 DF,  p-value: 1.44e-06

Okay, so it looks like our Midterm squared term is now significant. Our intercept and Midterm terms being the only non-significant terms. Lets take a look at the pairs plot again and see if there is anything else.

pairs(cbind(R=lm2$res, Fit = lm2$fit, midterm), panel = panel.smooth)

From this pairs plot, I’m beginning to think that we may need include a quadratic AnalysesTheory term. I’m just going to throw it in there and see.

lm3 <- lm(FinalExam+1~Midterm + I(Midterm^2) * AnalysesTheory, data = midterm)
summary(lm3)
## 
## Call:
## lm(formula = FinalExam + 1 ~ Midterm + I(Midterm^2) * AnalysesTheory, 
##     data = midterm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.669  -4.310   4.591  11.066  25.383 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -5.871e+00  1.376e+01  -0.427 0.671519    
## Midterm                     -5.873e-01  3.933e-01  -1.493 0.141665    
## I(Midterm^2)                 1.699e-02  5.919e-03   2.870 0.006000 ** 
## AnalysesTheory               8.335e-01  2.191e-01   3.805 0.000388 ***
## I(Midterm^2):AnalysesTheory -9.787e-05  3.974e-05  -2.462 0.017287 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.35 on 50 degrees of freedom
## Multiple R-squared:  0.4747, Adjusted R-squared:  0.4326 
## F-statistic: 11.29 on 4 and 50 DF,  p-value: 1.319e-06

Okay, so as you can see, everything except for Midterm is significant. Looking at the new pairs plot..

pairs(cbind(R=lm3$res, Fit = lm3$fit, midterm), panel = panel.smooth)

And AnalysesTheory still interests me. Lets add a cubed term to the model and see what happens.

lm4 <- lm(FinalExam+1~Midterm + I(Midterm^2) * AnalysesTheory + I(AnalysesTheory^3), data = midterm)
summary(lm4)
## 
## Call:
## lm(formula = FinalExam + 1 ~ Midterm + I(Midterm^2) * AnalysesTheory + 
##     I(AnalysesTheory^3), data = midterm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.697  -4.429   2.118  10.725  28.774 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                  1.050e+00  1.408e+01   0.075  0.94081   
## Midterm                     -4.753e-01  3.910e-01  -1.215  0.23004   
## I(Midterm^2)                 1.685e-02  5.804e-03   2.903  0.00553 **
## AnalysesTheory               4.885e-01  2.930e-01   1.667  0.10181   
## I(AnalysesTheory^3)          3.606e-05  2.082e-05   1.732  0.08958 . 
## I(Midterm^2):AnalysesTheory -1.151e-04  4.021e-05  -2.861  0.00620 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.01 on 49 degrees of freedom
## Multiple R-squared:  0.505,  Adjusted R-squared:  0.4545 
## F-statistic: 9.997 on 5 and 49 DF,  p-value: 1.232e-06

So it is not significant. The new Pairs plots..

pairs(cbind(R=lm4$res, Fit = lm4$fit, midterm), panel = panel.smooth)

Lets take out a term at a time here. I’m going to go ahead and throw out the Midterm term first.

lm5 <- lm(FinalExam+1~I(Midterm^2) + AnalysesTheory + I(AnalysesTheory^3) + I(Midterm^2):AnalysesTheory, data = midterm)
summary(lm5)
## 
## Call:
## lm(formula = FinalExam + 1 ~ I(Midterm^2) + AnalysesTheory + 
##     I(AnalysesTheory^3) + I(Midterm^2):AnalysesTheory, data = midterm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.279  -6.323   0.995  10.484  30.319 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  2.655e+00  1.408e+01   0.189 0.851219    
## I(Midterm^2)                 1.062e-02  2.744e-03   3.871 0.000315 ***
## AnalysesTheory               3.373e-01  2.665e-01   1.266 0.211489    
## I(AnalysesTheory^3)          4.024e-05  2.063e-05   1.951 0.056717 .  
## I(Midterm^2):AnalysesTheory -9.578e-05  3.714e-05  -2.579 0.012889 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.09 on 50 degrees of freedom
## Multiple R-squared:  0.4901, Adjusted R-squared:  0.4493 
## F-statistic: 12.01 on 4 and 50 DF,  p-value: 6.462e-07

Lets take out the AnalysesTheory now..

lm6 <- lm(FinalExam+1~I(Midterm^2) + I(AnalysesTheory^3) + I(Midterm^2):AnalysesTheory, data = midterm)
summary(lm6)
## 
## Call:
## lm(formula = FinalExam + 1 ~ I(Midterm^2) + I(AnalysesTheory^3) + 
##     I(Midterm^2):AnalysesTheory, data = midterm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.243  -6.083   0.539  11.074  31.879 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.718e+01  8.208e+00   2.093 0.041369 *  
## I(Midterm^2)                 9.571e-03  2.631e-03   3.639 0.000639 ***
## I(AnalysesTheory^3)          5.807e-05  1.516e-05   3.831 0.000352 ***
## I(Midterm^2):AnalysesTheory -7.926e-05  3.497e-05  -2.266 0.027693 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.19 on 51 degrees of freedom
## Multiple R-squared:  0.4737, Adjusted R-squared:  0.4428 
## F-statistic:  15.3 on 3 and 51 DF,  p-value: 3.16e-07

Okay, well all of our terms are significant but lets take a look at the pairs plot

pairs(cbind(R=lm6$res, Fit = lm6$fit, midterm), panel = panel.smooth)

It appears to me that AssementQuizActual as a pattern, lets add that in now..

lm7 <- lm(FinalExam+1~I(Midterm^2) + I(AnalysesTheory^3) + I(Midterm^2):AnalysesTheory + AssessmentQuizActual + I(Midterm^2):AssessmentQuizActual + I(AssessmentQuizActual^3), data = midterm)
summary(lm7)
## 
## Call:
## lm(formula = FinalExam + 1 ~ I(Midterm^2) + I(AnalysesTheory^3) + 
##     I(Midterm^2):AnalysesTheory + AssessmentQuizActual + I(Midterm^2):AssessmentQuizActual + 
##     I(AssessmentQuizActual^3), data = midterm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.908  -7.905   1.535   8.362  34.567 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                        1.694e+01  1.141e+01   1.484 0.144302
## I(Midterm^2)                       8.583e-03  3.040e-03   2.823 0.006904
## I(AnalysesTheory^3)                6.078e-05  1.521e-05   3.996 0.000221
## AssessmentQuizActual              -1.885e-02  3.050e-01  -0.062 0.950971
## I(AssessmentQuizActual^3)          2.178e-05  3.474e-05   0.627 0.533722
## I(Midterm^2):AnalysesTheory       -9.754e-05  3.558e-05  -2.742 0.008567
## I(Midterm^2):AssessmentQuizActual  2.952e-05  4.293e-05   0.688 0.494946
##                                      
## (Intercept)                          
## I(Midterm^2)                      ** 
## I(AnalysesTheory^3)               ***
## AssessmentQuizActual                 
## I(AssessmentQuizActual^3)            
## I(Midterm^2):AnalysesTheory       ** 
## I(Midterm^2):AssessmentQuizActual    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.57 on 48 degrees of freedom
## Multiple R-squared:  0.5398, Adjusted R-squared:  0.4823 
## F-statistic: 9.384 on 6 and 48 DF,  p-value: 8.249e-07

Well, its clear to see that AssessmentQuizActual does not have any significance to the model. Going back to the previous model.

lm8 <- lm(FinalExam+1~I(Midterm^2) + I(AnalysesTheory^3) + I(Midterm^2):AnalysesTheory, data = midterm)
summary(lm8)
## 
## Call:
## lm(formula = FinalExam + 1 ~ I(Midterm^2) + I(AnalysesTheory^3) + 
##     I(Midterm^2):AnalysesTheory, data = midterm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.243  -6.083   0.539  11.074  31.879 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.718e+01  8.208e+00   2.093 0.041369 *  
## I(Midterm^2)                 9.571e-03  2.631e-03   3.639 0.000639 ***
## I(AnalysesTheory^3)          5.807e-05  1.516e-05   3.831 0.000352 ***
## I(Midterm^2):AnalysesTheory -7.926e-05  3.497e-05  -2.266 0.027693 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.19 on 51 degrees of freedom
## Multiple R-squared:  0.4737, Adjusted R-squared:  0.4428 
## F-statistic:  15.3 on 3 and 51 DF,  p-value: 3.16e-07

And after one last look, I don’t see anything else that stands out to me.

pairs(cbind(R=lm6$res, Fit = lm6$fit, midterm), panel = panel.smooth)

I think that we have the right components of our model. Lets take a look at the residuals vs Fitted plot to see what’s going on there.

plot(lm6, which =1)

This looks okay, not great. Lets try removing the three outlier labeled 42 to see if that helps our improve our linearity/constant variance at all. A new lm summary and residuals vs fitted plot are shown below.

midterm1 <- midterm %>% 
  filter(row_number() != 42)

lm8 <- lm(FinalExam+1~I(Midterm^2) + I(AnalysesTheory^3) + I(Midterm^2):AnalysesTheory, data = midterm1)
summary(lm8)
## 
## Call:
## lm(formula = FinalExam + 1 ~ I(Midterm^2) + I(AnalysesTheory^3) + 
##     I(Midterm^2):AnalysesTheory, data = midterm1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.511  -6.543   0.373   9.712  30.472 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.717e+01  7.187e+00   2.389  0.02072 *  
## I(Midterm^2)                 1.124e-02  2.340e-03   4.805 1.45e-05 ***
## I(AnalysesTheory^3)          5.809e-05  1.327e-05   4.376 6.15e-05 ***
## I(Midterm^2):AnalysesTheory -9.789e-05  3.096e-05  -3.162  0.00267 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.05 on 50 degrees of freedom
## Multiple R-squared:  0.5337, Adjusted R-squared:  0.5058 
## F-statistic: 19.08 on 3 and 50 DF,  p-value: 2.21e-08
b <- lm8$coefficients
plot(lm8, which =1)

Removing the outliers did help our Constant variance. The linearity is still not great but should be alright.

Everything looks pretty good, but just because I’m a curious soul, lets try a boxCox to check for a possible transformation..

boxCox(lm8)

Well, it looks like we’ll stick to our current model as the boxCox suggests no transformation.

Therefore our model for predicting a final test score is:

\[ \underbrace{Y_i}_\text{Final Grade} = 17.1690306 + 0.0112411\underbrace{X_{1i}^2}_\text{Midterm} + 5.8087358\times 10^{-5}\underbrace{X_{1i}^2X_{2i}}_\text{AnalysesTheory} + -9.7888963\times 10^{-5}\underbrace{X_{1i}^3X_{2i}}_\text{Interaction} \]




Conclusion

As determined in the “Determining the Model” tab, our model is:

\[ \underbrace{Y_i}_\text{Final Grade} = 17.1690306 + 0.0112411\underbrace{X_{1i}^2}_\text{Midterm} + 5.8087358\times 10^{-5}\underbrace{X_{1i}^2X_{2i}}_\text{AnalysesTheory} + -9.7888963\times 10^{-5}\underbrace{X_{1i}^3X_{2i}}_\text{Interaction} \]

Using this model I will now input my other score information to determine my final score. I scored a 96 on my Midterm and currently have 100% on my Analyses.

p <- predict(lm3, data.frame(Midterm = 96, AnalysesTheory = 100))-1

My predicted Final Exam score is: 86.461289

Before going any further, lets look at the prediction intervals to see just how much this score can vary.

p_i <- predict(lm3, data.frame(Midterm = 96, AnalysesTheory = 100), interval = "prediction")-1

p_i %>% 
  pander()
fit lwr upr
86.46 48.86 124.1

Okay, the prediction intervals show that my final exam score can still vary quite a bit. Therefore I should put too much trust in this model prediction. However, lets still run through the logic.

Fitted prediction Logic

Assuming that I actually do score a 86.461289/100 on my final.

Should I dropped my Midterm, I would have a 86.461289 final percentage.

Should I keep my Midterm, I would have (70%)*86.461289 + (30%)*96 = 89.3229 final percentage.

Lower prediction Logic

Assuming that I actually do score a 48.8585981/100 on my final.

Should I dropped my Midterm, I would have a 48.8585981 final percentage.

Should I keep my Midterm, I would have (70%)*48.8585981 + (30%)*96 = 60.13941 final percentage.

Decision

Looking at both possibilities, I’d say that it is in my best interest to keep my Midterm.